-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reimplement sendmmsg/recvmmsg with better API #1744
Conversation
7f5ccbc
to
9e80118
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pacak! Overall the changes look very reasonable, although I've not done an entirely thorough review just yet.
A few requests:
- IMHO it makes sense to make a similar change for
sendmmsg
in the same release as this PR, to avoid splitting up the breaking changes into multiple releases (thus prolonging any associated pains). Would you be willing to make a similar change forsendmmsg
? - Do you have an example of retrofitting an existing caller to use the new API you can share? In places where we make nontrivial API changes, I'd like to maintain a list of example changes to help other clients migrate.
I can but it would take some time. It would be great if #1748 was merged before that. Currently making any changes to
Sort of. I updated the tests in |
Fair enough. The alternative would be to just disable the auto-formatting, but I agree that's annoying to do for one repo (when the proper fix is apparent). I've been reluctant to submit any bulk reformatting changes up until now to avoid causing merge conflicts for all outstanding PRs, but I think we've reached a cross-over point where most PRs already have merge conflicts anyways so we should just do it. @asomers thoughts? |
Yeah, I think it's time to do it. I for one don't have any big outstanding PRs. |
276e8a3
to
90e4d53
Compare
Ended up implementing it. Tests seem to work. Don't have use cases in real code so can't test it properly. |
@pacak can you rebase and reformat? Thanks! |
Seems to work. |
Something seems to be wrong here. This testcase: https://gist.github.com/rroohhh/acb6247f7634e4e68d53604c4c3f88b1
and in release mode with
|
Interesting. Moving buffers allocation out from lines 33 and 45 fixes the problem for me, looks like something holds an invalid pointer in both cases... Thank you for a great test case, I'll see how to fix the problem tomorrow. |
A good test case. A temporary array created on lines 33/45 used to be stored inside a pointer in |
Looks like nightly clippy introduced some new warnings. I'll look at fixing them in the next day or two. |
@pacak thank you for the swift response and fix |
@pacak apologies for the delay in responding to this. I've been trying to track down any potentially impacted clients to verify that this new solution fully meets their needs, but had previously been unsuccessful. Fortunately I've just found two. @amatissart your @nullchinchilla same as above, except with the It'd be greatly appreciated if you're willing to help out, but there's no obligation. Thank you both either way. 😃 |
0588088
to
076aee7
Compare
Yea, looks like we are getting the data but not the control header. Things to try in no particular order:
|
Rebased the branch, backported the test in #1837 |
@rtzoeller Can you try running both this and backported tests on your Pi? |
bors try |
tryMerge conflict. |
ubuntu@ip-10-90-3-99:~/nix$ cat /proc/cpuinfo | head -8
processor : 0
BogoMIPS : 243.75
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x3
CPU part : 0xd0c
CPU revision : 1 ubuntu@ip-10-90-3-99:~/nix$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy ubuntu@ip-10-90-3-99:~/nix$ rustc -vV
rustc 1.56.1 (59eed8a2a 2021-11-01)
binary: rustc
commit-hash: 59eed8a2aac0230a8b53e89d4e99d55912ba6b35
commit-date: 2021-11-01
host: aarch64-unknown-linux-gnu
release: 1.56.1
LLVM version: 13.0.0 ubuntu@ip-10-90-3-99:~/nix$ for i in `seq 1 10000` ;do cargo test test_recvmm2 --lib ; done 2> /dev/null > a
ubuntu@ip-10-90-3-99:~/nix$ cat a | sort | uniq -c
30000
10000 running 1 test
10000 test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 60 filtered out; finished in 0.00s
10000 test sys::socket::test::test_recvmm2 ... ok
Something tells me the problem is in the image or container CI uses. |
Also works on a bare metal instance. |
I would agree, except it is reproducing on the physical hardware in front me of me as well. 😭 Tired of being the bottleneck here, I spent some additional time on this today with the intent of finding a root cause. I had the idea today of using The only problem with that strategy is that using Which leads me to conclude that there is probably some sort of race condition going on, but that it's probably outside the scope of this code (i.e. in the kernel) and that it's only hitting very slow devices. It reproduces on the CI, and on my RPi3 with both the 5.17 and 5.19 kernels on Fedora 36, but I wouldn't expect either to be particularly quick. Alternatively, there is a bug in this code, but it's sufficiently subtle that we are unlikely to characterize it here without additional reports from other users. For good measure, I confirmed that the bug also reproduces on my hardware under the @pacak thank you for your effort and patience on this. Can you disable |
It could be related to the kernel version, I used Ubuntu 22.04 - latest LTS which uses 5.15. Also not shown in what I posted but I did the same test on 4 different instances - starting from the fast 64 core bare metal ones and down to slow dual core virtual one.
Will do. The test itself mostly works on aarch64, I'll just disable the assert about the timestamps on that platform. Do you think it's worth mentioning a potential problem on aarch64 in the docs? Or at least in comments? Or the conditions are rare enough so it's unlikely? It needs to be a combination of aarch64, slow hardware, |
CMSG_FIRSTHDR/CMSG_NEXTHDR operate in terms of pointers contained inside msghdr structure, vector capacity doesn't matter for them. This would change external behavior of recvmsg/recvmmsg in a sense that buffer passed to store controll messages won't have it's length updated but intended way to receive control messages is with cmsgs iterator on `RecvMsg` which would still work. This change is required to allow using a single vector to store control messages from multiple packets
This is already an unsafe function, dealing with pointers directly does not make it much more unsafe but simplifies lifetimes later on and similar to a previous commit allows to alocate a single buffer to store all the control messages
We'll be using that to reinitialize buffers later
New implementation performs no allocations after all the necessary structures are created, removes potentially unsound code that was used by the old version (see below) and adds a bit more documentation about bugs in how timeout is actually handled ``` let timeout = if let Some(mut t) = timeout { t.as_mut() as *mut libc::timespec } else { ptr::null_mut() }; ```
See nix-rust#1744 for more details
See nix-rust#1744 for more details
A comment would be a good idea - I don't think we need to call it out in the public docs. |
Added a comment. Also if we are talking race conditions - backported tests might be passing on the old version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Hmmm.... And we can test this theory by adding a sleep or some redundant allocations in the new test between calling recvmmsg and using the results - line 1847 or so. Well, you can because I can't reproduce the problem on my side. |
Motivation is explained in #1602, new version allows to receive data without performing allocations inside the receive loop and to use received data without extra copying.
This pull request contains a breaking change to API
recvmmsg
(obviously) and also affectsrecvmsg
- new version does not set length of control message buffer if one is passed. Later change can be avoided with a bit more copy-paste.Fixes #1602